Smart decision support system for keratoconus severity staging using corneal curvature and thinnest pachymetry indices

Background This study proposes a decision support system created in collaboration with machine learning experts and ophthalmologists for detecting keratoconus (KC) severity. The system employs an ensemble machine model and minimal corneal measurements. Methods A clinical dataset is initially obtained from Pentacam corneal tomography imaging devices, which undergoes pre-processing and addresses imbalanced sampling through the application of an oversampling technique for minority classes. Subsequently, a combination of statistical methods, visual analysis, and expert input is employed to identify Pentacam indices most correlated with severity class labels. These selected features are then utilized to develop and validate three distinct machine learning models. The model exhibiting the most effective classification performance is integrated into a real-world web-based application and deployed on a web application server. This deployment facilitates evaluation of the proposed system, incorporating new data and considering relevant human factors related to the user experience. Results The performance of the developed system is experimentally evaluated, and the results revealed an overall accuracy of 98.62%, precision of 98.70%, recall of 98.62%, F1-score of 98.66%, and F2-score of 98.64%. The application's deployment also demonstrated precise and smooth end-to-end functionality. Conclusion The developed decision support system establishes a robust basis for subsequent assessment by ophthalmologists before potential deployment as a screening tool for keratoconus severity detection in a clinical setting. Supplementary Information The online version contains supplementary material available at 10.1186/s40662-024-00394-1.


Background
Keratoconus (KC) is a degenerative condition that affects the cornea, the transparent layer at the front of the eye.It involves the gradual central thinning of the cornea, resulting in a conical or irregular shape and causing visual impairment [1].Both genders are not spared, and KC typically manifests in early adolescence and advances till the fourth decade of life.It asymmetrically affects both eyes and can markedly hinder vision, resulting in distorted vision, near-sightedness, and astigmatism [2].The exact cause of KC is not fully understood despite decades of research.A mix of environmental and genetic factors is thought to influence the onset and progression of this disease [2][3][4][5].
The prevalence and incidence of KC varies in different communities around the world [6][7][8].This can be attributed to the diversity of populations studied and the lack of specific guidelines for defining and classifying KC.Research indicates that, in comparison to other populations, the prevalence of KC is higher in Middle Eastern and South Asian populations.For example, the prevalence of KC in Iran has been increasing in recent years, from 1 in 126 in 2013 to 1 in 32 in 2018 [9][10][11].Research studies conducted in the UK [12][13][14][15][16][17][18][19] revealed a notable variation in the prevalence of KC among individuals from different ethnic backgrounds.Most KC cases were identified in individuals of Indian descent within a community comprising 87% White and 11% Asian (comprising individuals of Indian, Pakistani, or Bangladeshi backgrounds).By examining screening data from hospital records, researchers identified 229 Asian patients and 57 White patients with KC.The researchers concluded that there was a four-fold increase in the prevalence of KC among Indians and similar Asian communities, underscoring the significant ethnic component of the disease.Most of these prevalence studies were conducted on patients in hospitals or clinics, where it was easier to gather data.This likely underestimates the disease prevalence since patients are frequently asymptomatic, making it easier to overlook the earlier and more subtle manifestations of the disease [20].The true prevalence of KC, however, can be determined more accurately by populationbased screening studies.
Management of KC is challenging because this disease can be undetectable at its early stages, and standard eyeglasses or contact lenses may allow good visual acuity.Ealy diagnosis of KC is therefore important to manage symptoms related to reduced visual acuity and astigmatism as well as to prevent disease progression.Additionally, management of KC depends on the disease's stage and involves non-surgical and surgical options [21].Nonsurgical options are usually recommended in the early stages.These include advising patients to avoid eye rubbing as well as correction of vision.Spectacles and soft contact lenses are typically used in the early stages to correct near-sightedness, far-sightedness, and astigmatism.Rigid contact lenses are used for more progressive disease stages with irregular astigmatism [22].Although corrective glasses and lenses can correct the refractive error, they do not halt disease progression.Current practice is to proceed for corneal cross-linking for progressive or expected to progress KC cases.More advanced stages are managed surgically with a corneal ring implant or corneal transplantation (also known as keratoplasty) including partial thickness keratoplasty or full thickness (penetrating) keratoplasty for the severe conditions.
Corneal collagen cross-linking was approved by the US Food and Drug Administration (FDA) in 2016 and involves the application of a vitamin B2 (riboflavin) solution as a photosensitizer to the eye and ultraviolet light (UV-A) at a wavelength of 370 nm [23].New collagen bonds form, restoring and preserving the cornea's strength and flat spherical shape.Clinical trials show these changes persist for up to 7 years post-initial treatment [24].Another option is implanting a corneal ring, which involves placing a C-shaped ring inside the cornea stroma to flatten the cornea's surface.This reduces astigmatism, which results in improved visual acuity.Corneal transplant is a highly effective surgical option in which a donor cornea replaces the patient's damaged cornea.Studies show an excellent 5-year graft survival rate with more than 90% of patients having a corrected visual acuity of 6/12 or better [25].However, most patients still need glasses or contact lens to provide the optimal vision after keratoplasty.
The diagnosis of KC typically relies on a combination of medical history, physical exam (including optometric refractive assessment, retinoscopy, and slit-lamp biomicroscope), and corneal imaging studies.Devices commonly used to obtain images of the cornea are corneal topography, tomography, and optical coherence tomography (OCT) [26].Corneal topography is a special technology that maps the surface of the cornea in terms of elevation and curvature aspects of both the anterior and posterior surfaces.OCT provides high-resolution crosssectional scans of the cornea and ocular surface.Each tool has a set of parameters that are used to provide data to aid in KC diagnosis.
In recent years, machine learning (ML), a branch of artificial intelligence, has evolved as a promising tool for aiding the identification and diagnosis of complex conditions [27,28] including KC. Numerous supervised and unsupervised ML methods have been proposed for the diagnosis of KC.Supervised methods were trained with labelled input data to detect KC from unlabelled input data [29], while unsupervised learning used ML algorithms to identify patterns or clusters in the data [30].Deep learning, a sub-branch of ML designed for processing large datasets [31] has also been proposed for KC detection, and is especially adept at segmenting or classifying corneal images [32].These techniques were used to assess a wide range of parameters that were obtained from corneal imaging devices as well as other clinical and biometric variables to detect KC [33].When given corneal topography, tomographic data, or a combination of both, many of these methods effectively distinguish between two or more classes [34].
In the context of KC severity, studies that divided KC corneas into distinct clinical stages utilizing ML algorithms were based on a range of investigations that categorised KC corneas into different stages.In the studies of Bolarín et al. [35] and Velázquez-Blázquez et al. [36], the authors graded corneas into grades I-V, employing a classification system based on corrected distance visual acuity (CDVA).In [37], the authors graded corneas as 1-4 using the Amsler-Krumeich (AK) classification system that was primarily centered on keratometry but also incorporating refraction and pachymetry [38].
Another study [39] categorized KC corneas into mild and moderate stages through a classification scheme that was self-defined.Numerous studies have presented diverse ML models to predict KC severity.However, there is no consensus on a standardized set of parameters applicable for diagnosing KC or predicting its severity [40].This is possibly caused by the use of various diagnostic criteria, imaging instruments, and a lack of readily available datasets that can function as a reference for predicting KC severity levels [33].Moreover, most of these studies were conducted in an academic research setting [41], rather than being applied in clinical practice [42,43].This challenge arises from ineffective communication between clinicians and system developers, leading to caution in relying solely on ML predictions without supplementary clinical validation.
In contrast to prior studies on KC severity classifications, this study proposes a real-world decision support system that is collaboratively developed by both ML experts and ophthalmologists.The proposed system, utilizing an ensemble machine model and three Pentacam corneal indices, aims to assess KC severity before visual impairment occurs in a timely manner.A user-centered, iterative development methodology [44] is employed to build the proposed system, ensuring the ongoing engagement of potential end-users (ophthalmologists) throughout the development process.A transparent approach based on expert opinion is adopted to feature selection, model development and validation tests.This facilitates regular updates to models based on new data and continuous monitoring of the system's performance.The primary contributions of this study include: (i) a comprehensive approach to collecting and pre-processing a raw clinical dataset, (ii) the proposal of a severity staging system (0-4) based on only three corneal tomography parameters, (iii) the development and evaluation of multiple classification models capable of detecting various levels of KC severity, and (iv) the creation and deployment of a real-world online decision support system.This system aims to standardise the diagnostic criteria for KC severity across multiple eye-care facilities, thereby reducing the potential for human error, especially in geographical regions lacking specialist ophthalmologists.This research extends the outcome of an earlier study [45], carried out by the authors, which focused on the classification between normal and KC corneas.In this work, the emphasis is specifically on classifying various severity stages of KC.

System overview
The primary objective of the proposed system is to aid general practitioners, particularly those located in underserved geographical areas, in the screening for KC severity.Figure 1 depicts a streamlined workflow diagram illustrating the interaction between the user and the system, briefly outlined as follows: The user manually collects several corneal indices from a Pentacam imaging device and submits them to the system through a browser on a computing device, such as a laptop, tablet, or smartphone.The Flask web framework receives and processes the user's request.In response to this request, the Flask framework manages the input and produces a predicted KC severity level based on the received set of corneal indices.
The detection of the severity stage is performed by a ML model aided by an SQLite database that functions as a repository for user inputs, associated predictions, and user access credentials.This information can later be utilized for tracking disease progression and as additional training data to enhance the prediction accuracy.Subsequently, the web server communicates the prediction result to the user by delivering it to the user's browser, which then presents the result on the screen of the computing device in use.

Development methodology
The key phases in the development methodology of the proposed severity staging predictor are shown in Fig. 2. The process starts with the extraction of the study dataset from Pentacam [46].Pentacam is a corneal imaging device incorporating a slit illumination system and a camera that rotates jointly around the eye.The slit illuminates a thin layer within the eye, and due to their lack of complete transparency, the cells scatter the slit's light.Next, the collected data is pre-processed and labelled by a team of ophthalmologists.
A subset of several indices (features) is then identified to differentiate between the different severity levels of the disease.The identified features are then employed to create ML models that are pipelined (Fig. 2).It is worth noting that the classifying model of normal/KC corneas, as previously detailed by the authors [45], is beyond the scope of this paper.This study specifically focuses on the severity staging classifier (KC severity predictor).To enhance accessibility and standardize the diagnosis criteria across multiple eye-care facilities, a web interface was built and utilized to deploy the developed severity predictor on a web application server.The development methodology for the proposed system is presented and discussed later in the ML modelling section.

Study dataset
The dataset utilized in this study was collected over the preceding decade from two eye-care centers in Jordan: Jordan University Hospital (JUH) and Al-Taif Eye Center (ATEC).Ethical approval for the study was obtained from the Ethics Committees at both healthcare facilities (Protocols: JUH-2023-1593/67 and ATEC-GM/15).The dataset consisted of patients with a diagnosis of KC in one or both eyes.Diagnosis was established through clinical, optometric, and ophthalmic examinations, including slit-lamp assessment, retinoscopy, and corneal tomography data.The collected dataset, comprising 79 feature columns linked to 644 corneas with different severity stages, is shown in Fig. 3.
As illustrated, the dataset samples exhibit an imbalanced sample distribution among the various stages of KC severity.This imbalance, which is common in medical research [47,48], can lead to biased classification.Consequently, it is imperative to address this concern prior to training ML models to prevent potential biases in both training and classification performance.

Pre-processing
In this study, several pre-processing procedures were applied to the raw data to enhance its quality thereby improving the performance of the feature selection and ML modelling processes.These procedures are shown in Fig. 4 and are detailed as follows.

Data cleaning
Table 1 outlines the steps that are applied to the raw dataset, resulting in a reduction of feature columns from 79 to 58.Handling poor-quality data is essential in ML modelling; the Expectation-Maximization (EM) algorithm [49,50] is one of the widely used iterative methods for finding maximum likelihood or maximum posteriori estimates of parameters in statistical models.However, in the collected study dataset, the feature columns containing incomplete data are found to be irrelevant to the intended diagnosis, and thus were identified and safely filtered with the aid of expert ophthalmologists.
Identifying outliers often requires statistical methods or domain expertise [50].Common approaches include standard deviation, median absolute deviation, z-score, boxplot and ML techniques like clustering and anomaly detection algorithms.The boxplot [51], which relies on the interquartile ange (IQR), is adopted in this study due to its interpretability and effectiveness in identifying outliers within small datasets [52].Its strength lies in its resilience against extreme values, offering a more reliable measure than methods relying solely on mean or standard deviation.This is particularly beneficial for small datasets where outliers can disproportionately influence these traditional measures.Outliers are identified as observations falling below a lower bound = Q1 − k × IQR or above an upper bound = Q3 + k × IQR, where k = 1.5, and Q1 and Q3 represent the first and third quartiles, respectively [53].

Feature transformations
Several feature transformation techniques are implemented on the study dataset, encompassing the encoding of categorical data, skew transformation, and feature scaling.These techniques are briefly described as follows.
Feature encoding Involves the conversion of nonnumeric values to numeric values, a process commonly applied to categorical features representing qualitative data without inherent mathematical meaning.While easily comprehensible to humans, such data poses challenges for computers.Consequently, all categorical data are transformed into numerical data types.Binary or one-hot encoding (0, 1) is employed for nominal (categorical, unordered) features, while ordinal encoding (1, 2, … n) is utilized for ordered (categorical, ordered)  In this study, the latter two methods, which can normalize both positive and negative feature values to be within the range of − 1 and + 1-consistent with the characteristics of the study dataset-are explored.Results indicated that both techniques exhibit comparable performance in most cases, with the standard method slightly outperforming in the remaining instances, and thus the standard scaling method was adopted.

Labelling severity stages
A team of specialist ophthalmologists labelled the collected subjects using clinical examinations, slit-lamp assessments, and corneal topography data from Pentacam imaging devices.Pentacam exhibits the highest repeatability, establishing its effectiveness as a tool for KC severity classification and monitoring KC progression [42].After applying the labelling criteria, the study subjects were categorized into five severity stages (0-4).Concise definitions for these stages are outlined in Table 2, accompanied by a representative image of the Sagittal curvature (front) corresponding to each level.

Balancing class sampling
Addressing the uneven distribution within a dataset can be approached through various methods, such as oversampling minority classes, undersampling majority classes, or employing a combination of both strategies.In this study, the latter approach was adopted as follows.For the severity staging, where the available number of samples was relatively limited, the minority class samples for Stage 3 and Stage 4 were oversampled to achieve a reasonable balance with the samples from the remaining classes of Stage 0 to Stage 2. This is accomplished through the application of Synthetic Minority Oversampling TEchnique (SMOTE).SMOTE, known for its simplicity and effectiveness in addressing imbalances in small-sized datasets [57][58][59].It generates data points along the line segment between a randomly selected data point and one of its K-nearest neighbours.
Following the implementation of SMOTE, the minority classes of stages 0, 1, 3, and 4 were augmented to match the majority class samples (174) of Stage 2. As a result, the dataset was boosted from 644 to 870 samples, with 174 samples per class.Figure 5 presents a comparison between the real samples (left columns) and augmented ones (right columns) in each stage.These adjustments were anticipated to enhance the training and classification performance of the proposed models and mitigate the adverse effects of a small sample size.

Feature selection
The proposed feature selection process involved analysis of feature-relative importance and feature dependency using a combination of expert opinion, probability, and visual methods.

Feature dependency
Certain features, which either directly or indirectly rely on primary features have been identified with the aid of ophthalmologists.These features include [60]: • RSagMin depends on R_Min (mm).After filtering these features and others, the feature set was reduced from 58 to 40 features.

Feature relative importance
In ML, feature importance entails assigning scores to input features in a predictive model, indicating their relative significance in the prediction process.These scores are relevant to both regression problems, focused on predicting numerical values, and classification problems, where the objective is to predict class labels, as is the case in this study.It should be mentioned here that the feature importance is a relative measure within the context of the model and the specific dataset used for training.In practical applications, various ML libraries, including the scikit-learn library in Python, offer a "feature importance" attribute once a random forest (RF) classification model has been trained.In this model, a common method (called Gini), was utilised for calculating the feature importance scores.It is based on the Gini impurity reduction achieved by each feature.Although Gini impurity is not a conventional statistical test, it is a concept rooted in probability and information theory.This concept finds extensive application in ML, particularly in the construction of decision trees and the evaluation of feature importance within RF.The Gini method was applied to the remaining 40 features, resulting in their prioritization based on importance scores (Fig. 6).
The features with the top three scores are selected and employed in this study to create different ML models aimed at detecting distinct stages of KC severity.These features are: (i) the corneal posterior radius of curvature, Rm_B (mm), (ii) anterior radius of curvature, Rm_F (mm), and (iii) the thinnest pachymetry, Pachy_Min, attained relative importance scores of 0.938, 0.745, and 0.734, respectively.These scores serve as a valuable tool for identifying and prioritizing features based on their significance in the classification task (i.e., KC severity staging).Other features with slightly lower scores were often dependent on or derived from these core indices.For instance, the average pachymetry on concentric rings with radii 0 mm (D0mm_pachy) around the thinnest point of the cornea is technically the same as Pachy_min, and thus it was excluded to maintain clarity and prevent redundancy.It should be noted here that all the selected features were derived from a single corneal imaging device (Pentacam).

Visualisation
To better understand the relationships between the identified top features, a Python library called Seaborn, was utilised to generate multiple pairwise bivariate distributions using a pair plot (Fig. 7).This plot enables the visualization of individual feature distributions and the relationships between two features in the dataset.The univariate histograms for every feature were generated in the diagonal plots to illustrate the marginal distribution of the data in each column.Examining the diagonal as well as non-diagonal relationships between features helped to identify which feature pair will have the best separation between the target classes (i.e., severity stages).As illustrated, the Rm_B (mm) is more effective in separating the different severity classes than the Rm_F (mm) and Pachy_Min.This validates the significance of the selected features.

Machine learning modelling
A user-centered, iterative approach [44] was applied in the development of the proposed system, ensuring the continuous involvement of potential end users throughout the process.Figure 8 illustrates a simplified flow diagram of this process, with its distinct phases briefly described as follows.

Model selection
To establish the end-to-end configuration and validate the concept of the proposed ML solution, simple models can be utilized.This helps prevent excessively complex designs, reduces the time it takes to implement a solution [43], and may mitigate the potential risk of overfitting.Following the pre-processing of the dataset and identifying the most relevant subset of features for the target variable (i.e., severity stages), a classification model was chosen.This selection was made through experimentation and performance comparisons of three popular ML models in KC detection including severity staging.These models were logistic regression (LoR), support vector machines (SVM), and ensemble RF.These models are implemented using the Anaconda Jupyter notebook [61].The fundamental principles underlying these classification models are briefly described as follows.Logistic regression (LoR) classifier It is a probabilistic classification model that employs the Sigmoid function and limits the probability values to a range between 0 and 1.If the predicted value exceeds a specified threshold, the event is considered more likely to occur, while if it falls below the threshold, it is deemed less likely to occur [62].However, to apply LoR to multi-class classification, we utilized an extension known as multinomial LoR.This extension provided native support for addressing the five-class severity staging under investigation.

Support vector machine (SVM) classifier
It divides the various classes within the training set into groups using a surface that maximizes the margin between each class.The objective of SVM classification is to create lines that effectively partition the data points.The aim is to identify the optimal line i.e., one that maximizes the margin between the classes [63,64].SVM is well suited for binary classification problems but for multi-class challenges, a technique known as "one-versus-one" (OVO) is employed, wherein each class is matched against every other class.In the final stages of classification, during the testing phase, a single vote is cast for the predominant class in each classification.The class assigned to the test dataset is then determined by the highest number of votes.

Ensemble random forest (RF) classifier
It employs an ensemble approach, combining individual decision tree learners into a "forest" to enhance overall strength while maintaining a balance between robustness and prediction accuracy [45].The process involves generating numerous trees, and for each tree within the training set, the bootstrap aggregation (bagging) method is employed.Every tree in the forest receives input from the categorization algorithm, contributing a separate vote for each class.
The ultimate class determined by the RF is the one with the highest vote count [65].Furthermore, the RF maintains some distinction at each node when splitting similar features [66,67].

K-fold training and validation
This study utilizes k-fold cross-validation to reduce the influence of the specific selection of test and training data on model evaluation.It involves creating non-repetitive subsets from the training data.The study dataset was divided into six folds based on the optimal performance observed across various k-fold divisions.Specifically, five folds (83.33%) were utilized for training, and the remaining fold (16.67%) was reserved for validation.This iterative process was repeated six times, with a distinct fold designated for validation in each iteration, as illustrated in Fig. 9.The trained classifier was subsequently tested and validated using evaluation metrics, and the results were averaged over four runs.The average performance is calculated using Eq. 1, as follows:

Hyperparameter tuning
In RF, the number of estimators (n-estimators) serves as a crucial hyperparameter for bagging trees.Thus, minimizing the out-of-bag error involves tuning this parameter.The process begun with the use of two trees, and more were gradually added until the out-of-bag error stabilized at a specific minimum number of trees.In this experiment, both the model with the selected 3-feature subset and the 40-feature set were employed to determine the Performance(i) Fig. 8 The development process method optimal number of trees.As depicted in Fig. 10, the optimum number of trees was 150 for the 40-feature set and 50 for the 3-feature subset, beyond which the out-of-bag error curve flattens.Notably, utilizing the selected feature subset had resulted in a reduction of 66.66% in the number of trees.Similarly, the model's training time was also reduced by less than 30% compared to the time required for the 40-feature set.
The tuning of both the number of trees and other parameters of the RF model was also experimented through two distinct methods: GridSearchCV (GSCV) and RandomSearchCV (RSCV).GSCV extensively explores a prespecified set within the targeted model's hyperparameter range [68,69] while RSCV uses a probability distribution to assign a value to each hyperparameter individually [70], making it notably faster than GSCV.However, the results obtained from the GSCV method exhibited greater consistency with the number of estimators obtained from Fig. 10, resulting in enhanced performance.Tuning parameters of both the LoR and SVM classifiers were experimented using both GSCV and RSCV methods.Likewise, the parameters tuned by GSCV for both classifiers resulted in better performance compared to those obtained by the RSCV method.As a result, GSCV was employed to fine-tune the parameters of all the implemented models.The main parameters of the implemented models are given in the Appendix (Tables A. 1, A. 2 and A. 3).

Results
A confusion matrix is a commonly used graphic for evaluating the performance of a specific classification and is employed to assess the effectiveness and robustness of the developed models.The ground truth (target The results presented in the confusion matrices of Fig. 11 that are utilized to assess performance of the created models, are computed using Eqs.2, 3, 4, 5 and 6 as follows: Accuracy -the ratio of accurate predictions to the total number of input samples, calculated as: Precision -the average percentage of the actual positive cases among the retrieved instances, calculated as: Sensitivity (or Recall) -the percentage of actual positive cases that were correctly predicted, calculated as: F1-score -the sensitivity and precision of the system are both considered in the calculation of this score: F2-score -the precision-and sensitivity-weighted harmonic mean (given a threshold value), calculated as: In contrast to the F1-score, which assigns equal importance to precision and sensitivity, the F2-score diminishes the significance of precision while amplifying the importance of sensitivity.As a result, it places greater emphasis on minimizing FN rather than minimizing FP.Table 3 presents the average performance outcomes for predicting the severity stages in the developed models.As evident, the RF model exhibited superior performance compared to both the SVM and the LoR.Therefore, in (2) the context of distinguishing between different levels of KC severity, the ensemble RF model was employed as a predictor within the proposed system.

Model deployment and improvement
To assess the developed model in a real-world setting, it needs to be incorporated into the necessary software infrastructure for execution.This process encompasses integration, monitoring, and updates post-initial deployment.The integration of the model comprises two essential tasks: setting up the infrastructure for model execution and implementing the model itself.To achieve this, a lightweight Flask web framework [71] was employed to construct the interface essential for incorporating the developed KC predictor.Flask facilitates the development of online applications using Python, equipped with various libraries and frameworks, especially suitable for projects involving artificial intelligence.The primary resources of Flask utilized to craft the web interface for the proposed system are depicted in Fig. 12 and briefly outlined in Table 4.
The ML community is still facing challenges in monitoring and updating ML systems [76].For example, they are still learning what data and model metrics are most important to track and how to set off alarms on the system when abnormal behaviour is detected [42].The optimal methods for monitoring changing input data, addressing prediction bias, and evaluating the overall performance of ML models remain unclear.Furthermore, ensuring that the model consistently reflects the latest developments in data and the environment often necessitates the ability to update the model post-initial deployment.Several methods exist for updating models with new data, including continuous learning and regularly scheduled retraining.A crucial factor influencing the frequency and quality of the model update process is concept drift, commonly referred to as dataset shift [77].

Clinical classification
Several classification schemes for KC severity have been reported in the literature [78][79][80][81][82][83][84][85][86].The AK classification system, one of the earliest systems, categorizes the severity of KC into four stages.It considers factors such as spectacle refraction, central keratometry, the presence or absence of scars, and central corneal thickness [87].To improve the classification of disease severity, others have made modifications and additions to this classification [56,88].Alongside these classification systems, having a standardized method for documenting the progression of ectasia is crucial.The decision to recommend treatments such as corneal cross-linking heavily depends on well-documented ectasia progression in clinical assessments.

Component Description
Views A class that allows for the creation of several instances with varying arguments.It can be utilized to modify the behaviour of the view.In the current prototype, this class is connected to the app.route decorator that loads the required data onto a web page and displays it Routing URLs It associates URLs with operations like serving pages or data.The created prototype makes use of static route URLs.However, more sophisticated applications can also make use of dynamic route URLs Statics A subdirectory contains the application's JavaScript and cascading style sheets (CSS).As a result, users can access these files using the secure HTTP extension (HTTPS) and the Hypertext Transfer Protocol Templates Provides various types of data files, including photos, Java Scripts, or cascading style sheets (CSS).It offers static file management [74,75].The current prototype also makes use of Bootstrap to adjust the webpages to fit different screen sizes

Model
Flask can be used with and without database.In this prototype, the SQLite database is employed to temporarily store the user inputs and the relevant predictions together with the user's access credentials The 2015 global consensus that was published by a committee of expert ophthalmologists [21,89] concluded that "abnormal posterior ectasia, abnormal corneal thickness distribution, and clinical non-inflammatory corneal thinning are mandatory findings to diagnose keratoconus."However, this definition is not easy to implement because the agreement did not specify thresholds or parameters for diagnosing KC including its severity stages, and thus it is still subject to different interpretations.In the studies of Duncan et al. [90,91], the authors proposed an ABCD classification system that scores KC severity from 0 to 4.More recently, in response to limitations in the AK system and guided by the global consensus document on KC and ectatic diseases, Belin et al. [92,93] introduced a new ABCD severity staging system.The utilization of this system on Pentacam (Oculus GmbH, Wetzlar, Germany) [46] was motivated by its high measurement repeatability, surpassing that of other corneal imaging devices [94].
Each of the reported classification systems provides unique insights into the extent, location, and clinical signs of KC, contributing to a comprehensive evaluation of disease severity.In this study, the subjects in the study dataset were therefore graded utilizing a combination of clinical examinations, slit-lamp assessments, and corneal topography data obtained from Pentacam imaging devices, as detailed in the section on pre-processing.The classification results from the ML predictions demonstrated a strong correlation with the clinical classifications.This confirms the validity and effectiveness of the developed ML model.

Feature selection
Experimentation involved a raw clinical dataset comprising 644 subjects (augmented to 900 samples, with 180 samples per class), and 79 feature columns.After several data cleaning steps, the feature columns were reduced to 58 features.Subsequently, a feature selection process involved feature-relative importance and feature dependency analysis was implemented.A combination of expert opinion, probability, and visual methods were employed to narrow down the features to a subset of only three, representing a mere 3.79% of the total raw dataset features.
The significance of this selected feature subset, characterized by high relative importance, was validated through both visual observations (depicted in Fig. 7) and the consensus of domain experts.This confirmed the reliability and effectiveness of the implemented pre-processing and feature selection process.The significance of the selected features in the classification of KC severity are outlined as follows: Posterior radius of curvature (PRC) in the 3.0 mm zone, represented by Pentacam's Rm_B (mm) parameter.It measures the curvature of the posterior (back) surface of the cornea.This measurement is critical for assessing the shape and structure of the cornea, playing a pivotal role in the assessment of KC severity, which involves structural changes in the posterior corneal surface.In the relative importance analysis presented in Fig. 6, the PRC attained the highest ranking, scoring 0.938.
Anterior radius of curvature (ARC) in the 3.0 mm zone, denoted by Pentacam's Rm_F (mm) parameter.It measures the curvature of the cornea's anterior (front) surface.This measurement holds significance in evaluating the shape of the cornea and is frequently considered in the assessment of overall corneal condition including KC severity.ARC secured the second-highest position in the relative importance analysis, achieving a score of 0.745, as shown in Fig. 6.
Thinnest pachymetry measured in µm, represented by Pentacam's Pachy_Min parameter.It offers insights into the minimum thickness at a specific point called the thinnest location.This measurement is crucial for assessing the severity of KC, where variations in corneal thickness are indicative of the condition's progression and severity.In the feature selection analysis, this parameter ranked third with a score of 0.734 (Fig. 6).
Table 5 presents median values of the selected features, and these values correspond to the thresholds specified in Belin's ABCD grading system for the respective severity levels [92,93].However, Belin's system also considers the best-corrected visual acuity (BCVA) in addition to the features identified in this study.The BCVA is obtained through an optometric refractive examination and remains independent of corneal topography.Also, it should be noted that this set of features is distinct from the subset that was previously identified in [45] for the classification of normal and KC corneas.

Model classification performance
The clinical dataset employed in this research was gathered and validated by ophthalmologists and underwent  6 presents a comparison between the proposed system and state-of-the-art methods, considering various common performance indicators.This comparison also encompasses information related to the models used, dataset sizes, input data types, as well as the number of input features (parameters) used.
In contrast to the classification outcomes detailed in [101], which achieved a maximum AUC of 88% across multiple severity levels (five classes), using only three input features, our proposed classifier outperformed these results.The proposed system demonstrated high performance measured in terms of an overall accuracy of 98.62%, precision of 98.70%, sensitivity of 98.62%, F1-score of 98.66%, and F2-score of 98.64%.For studies that reported multiple models, the models with the best performance characteristics are reported in Table 6.
Additionally, it is imperative to acknowledge the challenge of making direct comparisons, given the absence of a standardized grading system for categorizing KC severity across these studies [21].

The integrated system
A fully functional decision support system for KC severity detection has been developed, successfully deployed, and tested on a web server.This system, which is collaboratively designed with ophthalmologists, is currently under additional testing to evaluate the model's generalizability.Figure 13 shows example test scenarios that represent various severity stages using new data that was not used in the training or validation test of the model.At this stage, the design of the graphical user interface remains intentionally simple to facilitate a pilot feasibility and acceptability study of the proposed system as a new diagnostic tool.These steps are considered crucial precursors to addressing challenges in implementing the system in clinical settings.The implementation of the developed decision support system offers significant opportunities to enhance the clinical practice of KC diagnosis by: • Facilitating the adoption of a standardized and objective diagnostic approach to severity staging by eye-care professionals, thereby reducing variability, and ensuring consistency in patient management across different practice settings.• Increasing accessibility to KC diagnosis and severity staging across multiple eye-care facilities, irrespective of time or location.• Providing automated analysis and interpretation of corneal curvature and pachymetry indices.This is particularly important in regions where accessing expert ophthalmologists is challenging.• Relying on measurements obtained from a single corneal imaging device contrast with Belin's classification system, where the CDVA is a significant aspect to consider in KC severity staging.
• Assisting ophthalmologists in making informed decisions, particularly in settings where expertise in interpreting advanced diagnostic imaging is limited.
Moreover, deploying the developed application on a web server has not only enhanced its accessibility but also opened doors to new research possibilities.This includes evaluating system performance across various dimensions such as latency, stability, and security.Additionally, it enables the exploration of the feasibility and acceptability of the system as a novel KC severity screening tool in the clinical setting.

Conclusion
The collaboration between ML experts and ophthalmologists plays a crucial role in improving clinical practice.To enhance the KC detection process, we proposed a realworld decision support system for KC staging utilising ML models and a small subset of corneal indices.The created system is a result of a close collaboration between ML experts and a team of specialist ophthalmologists.A transparent and responsible approach is adopted to A reliable subset of corneal parameters that includes curvature and thinnest pachymetry indices has been identified and utilized to create a highly efficient ensemble model based on a RF classifying algorithm.The utilisation of these features has streamlined the model's structure and considerably reduced its training time, all while preserving a high level of prediction accuracy.
The obtained findings demonstrated that the potential role of ML in KC screening is promising towards improving patient care in everyday ophthalmologic practice.To transform the developed system into a practical application, we have successfully integrated and deployed the developed model into a real-world web application server.The developed system has a promising potential as a KC severity screening tool, especially in areas lacking specialist ophthalmologists.
Future improvements for the developed system encompass multiple aspects, including: • Evaluating the model's generalizability and interpretability.• Updating the system post-initial deployment to align with the newly collected data and the environment.• Exploring the implementation of advanced ensemble learning techniques to further enhance resilience and accuracy of KC detection including the severity staging.• Exploring the feasibility of automating the transfer of corneal measurements from the Pentacam devices to our application to minimise the potential for human error and ensure more accurate and reliable data integration.• Providing possible treatment options and referral guidelines.
These aspects, among others, constitute ongoing research endeavours of the authors.

Fig. 1
Fig. 1 Workflow of the user-system interaction

Fig. 2
Fig. 2 Development stages of the proposed staging predictor

Fig. 5
Fig. 5 Comparison between real samples (left columns) and augmented samples (right columns) in each stage

Fig. 6
Fig. 6 The relative importance of features within the dataset for predicting the severity class labels based on the Gini method (n = 40).Asph_QB, asphericity coefficient (Q value) of the corneal back surface (posterior), asphericity Q value refers to the variation in the curvature of the cornea from its center to the periphery; Asph_QF, asphericity coefficient (Q value) of the corneal front surface (anterior); Astig_B (D), central corneal astigmatism (posterior corneal values measured in diopters); Astig_F (D), central corneal astigmatism (anterior corneal values measured in diopters); Axis_B (flat), corneal meridian of the least astigmatic power (posterior); Axis_F (flat), corneal meridian of the least astigmatic power (anterior); CKI, central keratoconus index; D0mm_Patchy -D10mm_Pachy, average pachymetry on concentric rings with radii (0-10 mm) around corneal thinnest point, respectively; IHA, index of height asymmetry; IHD, index of height decentration; ISV, index of surface variance; IVA, index of vertical asymmetry; KI, keratoconus index; KMax_Seg_Front (D), keratometry of the steepest point (anterior); Num_Ecc_B and Num_ Ecc_F, Fourier-based posterior and anterior eccentricity in central 30 degrees, respectively; Pachy_Apex, corneal thickness in apex; Patchy_ Min, thinnest pachymetry (µm); Pachy_Min_Pos_X and Pachy_Min_Pos_Y, x-and y-coordinates of the thinnest location, respectively; Pupil_Pos_X and Pupil_Pos_Y, x-and y-coordinates of the pupil position relative to the corneal apex, respectively; Pachy_Pupil, corneal thickness at the pupil center; Rh_F (mm), central radius in horizontal direction (anterior); Rm_B (mm), curvature radius of the back surface of the cornea (posterior); Rm_F (mm), curvature radius of the front surface of the cornea (anterior); Rs_F (mm), steepest radius (anterior); R_Per_F (mm), average anterior radius of curvature between 6 mm and 9 mm zone; R_Per_B (mm), average posterior radius of curvature between the 6 mm and 9 mm zone; Rv_B (mm), central radius in vertical direction (posterior); Rv_F (mm), central radius in vertical direction (anterior)

Fig. 7
Fig. 7 Pairwise bivariate distributions of the selected features.Rm_B (mm), curvature of the back surface of the cornea (posterior), measured in mm; Rm_F (mm), curvature of the front surface of the cornea (anterior), measured in mm; Patchy_Min, thinnest pachymetry measured in µm

Fig. 11 Fig. 12
Fig. 11 Confusion matrixes of the developed classifier models.a Logistic regression; b Support vector machine; c Random forest

Fig. 13
Fig. 13 Example test results for corneas at various KC severity stages.a Stage 0; b Stage 1; c Stage 2; d Stage 3; e Stage 4

Table 1
[54]ine of the implemented data cleaning proceduresSkew transformation raw datasets often exhibit positive skewness (peaking to the right) or negative skewness (peaking to the left), deviating from a normal distribution.Numerous statistical tests, including ANOVA, F-test, and others, require data to have a normal or near-normal distribution.The current dataset exemplifies such asymmetry, with skew values ranging from 3.33 to − 15.47; values notably outside the acceptable range of typical statistical tests (+ 2 to − 2)[54].It becomes imperative to eliminate this skewness, bringing the dataset as close as possible to a normal Gaussian distribution.After experimenting with multiple transformations includ- Data replicationRedundant elements, which share the same value, are dependent on other parameters, or are derived from them, are eliminated from the dataset in close consultation with ophthalmologists Constant values Feature columns with constant values are excluded as they lack informative content that could assist the ML model in distinguishing between various disease conditions Outliers Data values that significantly deviate from other data elements are filtered out features.For instance, numerical values (0-4) are used to replace diagnosis labels indicating severity stages (0-4).ing the log, Box-Cox, square root (SQRT) and others, the SQRT was identified as the most suitable method to bring all skewed features within the acceptable range.Feature scaling Prior to training the proposed models, data normalization is employed on the dataset to mitigate distortions arising from features with disparate scales, facilitating improved interpretation of distancebased approaches.Various methods exist to normalize feature values, ensuring they are measured on a consistent scale.Common techniques include min-max scaling, mean scaling, and standard scaling.

Table 2
[55,56] definitions of keratoconus severity stages[55,56] Representative Image DescriptionStage 0 -clear cornea with normal thickness and corrected distance visual acuity (CDVA) of 6/6Stage 1 -clear cornea with the potential presence of Fleischer's ring, mild corneal thinning evident on topography but not grossly, distorted reflex on retinoscopy, and CDVA less than 6/6Stage 2 -Fleischer's ring and Vogt's striae, corneal thinning may be evident grossly, scissoring reflex on retinoscopy and CDVA below 6/12Stage 3 -Initial manifestation of Munson's sign, significant corneal thinning with faint scarring, retinoscopy difficult to perform, spectacles distance visual acuity worse than 6/30, yet there is potential improvement to 6/6 with contact lenses Stage 4 -corneal scarring and opacities at the apex, evident Munson's sign, retinoscopy impossible to perform, CDVA worse than 6/120 and not achieving 6/6 even with contact lenses

Table 3
Performance comparison of the developed models LoR = logistic regression; SVM = support vector machines; RF = random forest

Table 5
Median values of the selected features for different severity stages PRC = posterior radius of curvature; ARC = anterior radius of curvature